Mittwoch, 19. Dezember 2012

[Solved] Hadoop is processing "ghost data" from input

This fuck of a problem cost me more than 4 hours to solve. I have to note this down somewhere to take the edge off..

If you are using Linux (ubuntu in my case)
If your mapreduce is picking up input data that is not within your input folder
If you delete the content of your input folder and mapreduce still reads "ghost data"*
If you have renamed your input folder and the problem still continues
If you have deleted hadoop temp directories and the problem still continues
If you have rebooted your machine and emptied your trash and the problem still continues

FFS make sure that the input directory is really empty! Linux creates temp files that end with "~". These will not show up when you check the directory size, the directory content (because it is not visible by default) and will not show up with the ls command (because it is not visible by default).

:|

*Ghost Data: Data that is not visible and seems not to be there.

Donnerstag, 13. Dezember 2012

XML Writable (Serializing) and the InstantiationException

If you want to serialize (encode) non-bean objects in java, you first have to prepare readFields() and writeFields() methods for your it so java has an idea on how the data in the object is meant to be stored and re-created. Shortly, you need to make your object Writable in hadoop terms.

Exception:

java.lang.InstantiationException:
(CLASS NAME YOU ARE TRYING TO SERIALIZE)
Continuing ...
java.lang.Exception: XMLEncoder: discarding statement ArrayList.add(MyObject);
Continuing ...
java.lang.InstantiationException:
(CLASS NAME YOU ARE TRYING TO SERIALIZE)
Continuing ...
java.lang.Exception: XMLEncoder: discarding statement ArrayList.add(MyObject);
Continuing ...

SITUATION BEFORE:

Object class (this is the class of the object you would like to serialize):

public class MyObject {
[VARIABLES]
[CONSTRUCTOR(S)]
[GETTERS&SETTERS]
}

Call from main (encode and print out);

main(String[] args) {
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
XML encoder = new XMLEncoder(buffer);
encoder.writeObject(javaBean);
encoder.close();
System.out.println(buffer.toString());
}


AFTER:

public class MyObject implements Writable {
[VARIABLES]
[CONSTRUCTORS]
[GETTERS&SETTERS]

@Override
writefields() throws IOException {
WritableUtils.writeString(out, variable1);
WritableUtils.writeString(out, variable2);
WritableUtils.writeString(out, variable3);
}

@Override
readFields() throws IOException {
variable1 = WritableUtils.readString(in);
variable2 = WritableUtils.readString(in);
variable3 = WritableUtils.readString(in);
}

}

The call in main is the same..

It is very important that the order of variables in read and write are the same. If you put variable1 in the first place of writeFields(), it needs to be on the first place on readFields() too so it can be put back correctly. Also your class MUST have a no-args constructor.

Montag, 10. Dezember 2012

Solution: “utility classes should not have a public or default constructor”

“utility classes should not have a public or default constructor”

If a class is static, you probably want to use its methods "directly" rather than instantiating it first. These kind of classes are more like tools/utilities than being meant for objects. Checkstyle warns you in this case that this class can still be instantiated. A little bit annoying for my liking ;)

Solution:

make the class final:
public final class myclass {
...
}

and create an empty private constructor:

private myclass() {
}

that should do the trick!

Montag, 3. Dezember 2012

NPE: org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73) (SOLVED)


My NullPointerException looked something like this:
java.lang.NullPointerException
    at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
    at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:959)
    at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:892)
    at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:393)
    at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354)
    at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476)
    at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:61)
    at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.<init>(ReduceTask.java:569)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:638)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
After a few google searches I found the solution. Turns out that you need to add the serialization lib's to the Job configuration manually.. So my conf and the setup looks like this:
Configuration conf = new Configuration();
        conf.set("io.serializations","org.apache.hadoop.io.serializer.JavaSerialization,"
        + "org.apache.hadoop.io.serializer.WritableSerialization");

I cannot Understand chinese, but good thing that Programming code is (mostly) universal! :D

Link to the original thread: Link