Thanks Nick... still no luck.

If I use "?q=somerandomchars&fields=title,_source"

I get an exception about empty collection, which seems to indicate it is
actually using the supplied es.query, but somehow when I do rdd.take(1) or
take(10), all I get is the id and an empty dict, apparently... maybe
something to do how my index is setup in ES ?

In [19]: es_rdd.take(4)
14/12/09 16:25:17 INFO SparkContext: Starting job: runJob at
14/12/09 16:25:17 INFO DAGScheduler: Got job 18 (runJob at
PythonRDD.scala:300) with 1 output partitions (allowLocal=true)
14/12/09 16:25:17 INFO DAGScheduler: Final stage: Stage 18(runJob at
14/12/09 16:25:17 INFO DAGScheduler: Parents of final stage: List()
14/12/09 16:25:17 INFO DAGScheduler: Missing parents: List()
14/12/09 16:25:17 INFO DAGScheduler: Submitting Stage 18 (PythonRDD[30] at
RDD at PythonRDD.scala:43), which has no missing parents
14/12/09 16:25:17 INFO MemoryStore: ensureFreeSpace(4776) called with
curMem=1979220, maxMem=278302556
14/12/09 16:25:17 INFO MemoryStore: Block broadcast_32 stored as values in
memory (estimated size 4.7 KB, free 263.5 MB)
14/12/09 16:25:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage
18 (PythonRDD[30] at RDD at PythonRDD.scala:43)
14/12/09 16:25:17 INFO TaskSchedulerImpl: Adding task set 18.0 with 1 tasks
14/12/09 16:25:17 INFO TaskSetManager: Starting task 0.0 in stage 18.0 (TID
19, localhost, ANY, 24823 bytes)
14/12/09 16:25:17 INFO Executor: Running task 0.0 in stage 18.0 (TID 19)
14/12/09 16:25:17 INFO NewHadoopRDD: Input split: ShardInputSplit
14/12/09 16:25:17 WARN EsInputFormat: Cannot determine task id...
14/12/09 16:25:17 INFO PythonRDD: Times: total = 289, boot = 5, init = 284,
finish = 0
14/12/09 16:25:17 ERROR NetworkClient: Node [Socket closed] failed (; selected next node []
14/12/09 16:25:17 INFO Executor: Finished task 0.0 in stage 18.0 (TID 19).
1886 bytes result sent to driver
14/12/09 16:25:17 INFO TaskSetManager: Finished task 0.0 in stage 18.0 (TID
19) in 316 ms on localhost (1/1)
14/12/09 16:25:17 INFO TaskSchedulerImpl: Removed TaskSet 18.0, whose tasks
have all completed, from pool
14/12/09 16:25:17 INFO DAGScheduler: Stage 18 (runJob at
PythonRDD.scala:300) finished in 0.324 s
14/12/09 16:25:17 INFO SparkContext: Job finished: runJob at
PythonRDD.scala:300, took 0.337848207 s
[(u'en_20040726_fbis_116728340038', {}),
 (u'en_20040726_fbis_116728320448', {}),
 (u'en_20040726_fbis_116728330192', {}),
 (u'en_20040726_fbis_116728330145', {})]

In [20]:

On Tue, Dec 9, 2014 at 10:18 AM, Nick  wrote:

> try "es.query" something like "?q=*&fields=title,_source" for a match all
> query. you need the "q=*" which is actually the query part of the query
> On Tue, Dec 9, 2014 at 3:15 PM, Mohamed Lrhazi <
>> wrote:
>> Hello,
>> Following a couple of tutorials, I cant seem to get pysprak to get any
>> "fields" from ES other than the document id?
>> I tried like so:
>> es_rdd =
>> sc.newAPIHadoopRDD(inputFormatClass="",keyClass="",valueClass="",conf={
>> "es.resource" : "en_2004/doc","es.nodes":"rap-es2.uis","es.query" :
>> "?fields=title,_source" })
>> es_rdd.take(1)
>> Always shows:
>> Out[13]: [(u'en_20040726_fbis_116728340038', {})]
>> How does one get more fields?
>> Thanks,
>> Mohamed.

