Building a hash table from a csv file using yarn-cluster, and giving it to each executor

2014-11-13 Thread YaoPau
I built my Spark Streaming app on my local machine, and an initial step in
log processing is filtering out rows with spam IPs.  I use the following
code which works locally:

// Creates a HashSet for badIPs read in from file
val badIpSource = scala.io.Source.fromFile(wrongIPlist.csv)
val ipLines = badIpSource.getLines()


val set = new HashSet[String]()
val badIpSet = set ++ ipLines
badIpSource.close()

def isGoodIp(ip: String): Boolean = !badIpSet.contains(ip)

But when I try this using --master yarn-cluster I get Exception in thread
Thread-4 java.lang.reflect.InvocationTargetException ... Caused by:
java.io.FileNotFoundException: wrongIPlist.csv (No such file or directory). 
The file is there (I wasn't sure which directory it was accessing so it's in
both my current client directory and my HDFS home directory), so now I'm
wondering if reading a file in parallel is just not allowed in general and
that's why I'm getting the error.

I'd like each executor to have access to this HashSet (not a huge file,
about 3000 IPs) instead of having to do a more expensive JOIN.  Any
recommendations on a better way to handle this?  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Building-a-hash-table-from-a-csv-file-using-yarn-cluster-and-giving-it-to-each-executor-tp18850.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Building a hash table from a csv file using yarn-cluster, and giving it to each executor

2014-11-13 Thread aappddeevv
If the file is not present on each node, it may not find it.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Building-a-hash-table-from-a-csv-file-using-yarn-cluster-and-giving-it-to-each-executor-tp18850p18877.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org